We introduce and study the problem of computing the similarity self-join in astreaming context (SSSJ), where the input is an unbounded stream of itemsarriving continuously. The goal is to find all pairs of items in the streamwhose similarity is greater than a given threshold. The simplest formulation ofthe problem requires unbounded memory, and thus, it is intractable. To make theproblem feasible, we introduce the notion of time-dependent similarity: thesimilarity of two items decreases with the difference in their arrival time. Byleveraging the properties of this time-dependent similarity function, we designtwo algorithmic frameworks to solve the sssj problem. The first one, MiniBatch(MB), uses existing index-based filtering techniques for the static version ofthe problem, and combines them in a pipeline. The second framework, Streaming(STR), adds time filtering to the existing indexes, and integrates newtime-based bounds deeply in the working of the algorithms. We also introduce anew indexing technique (L2), which is based on an existing state-of-the-artindexing technique (L2AP), but is optimized for the streaming case. Extensiveexperiments show that the STR algorithm, when instantiated with the L2 index,is the most scalable option across a wide array of datasets and parameters.
展开▼